10/05/2019

Outline

  • General introduction to data visualization

  • Introduction to ggplot2 in R

  • Practices with ggplot2

  • Introduction to other graphics package in R

General introduction to data visualization

When to visualize data

  • Exploratory data analysis

    • Explore pattern, trend, and distribution

    • Find correlation between variables

    • Regression analysis

  • Statistical analysis

  • Report your results

  • Communicate with non-statisticians

    • Share findings

    • Show fancy plots to your audiences

What to plot

  • One variable: Histogram, Bar chart, Density plot…

  • Two variables: Scatter plot, Box plot, Violin Plot…

  • Multiple variables: Heatmap…

  • Think of your data and variables carefully, and choose the most appropriate statistical plot.

Why we need data visualization

  • Better summary of statistics than table and text.

  • Easy to show a trend or a pattern in the data.

  • A more interesting way to catch your audiences’ eyes.

  • For fun…

Introduction to the data set

##             TEAM    SEASON  WIN.   PTS OFFRTG DEFRTG   PACE REGION ABV
## 1  Atlanta Hawks 2015-2016 0.585 102.8  104.6  100.8  97.63   East ATL
## 2  Atlanta Hawks 2018-2019 0.354 113.3  107.5  113.1 104.56   East ATL
## 3  Atlanta Hawks 2017-2018 0.293 103.4  104.4  110.1  98.76   East ATL
## 4  Atlanta Hawks 2016-2017 0.524 103.2  104.5  105.2  97.76   East ATL
## 5 Boston Celtics 2015-2016 0.585 105.7  105.8  102.5  99.43   East BOS
## 6 Boston Celtics 2016-2017 0.646 108.0  110.6  108.0  97.21   East BOS
  • WIN.: Winning rate, which is the percentage of games played that a team has won.

  • PTS: The number of points scored.

  • OFFRTG: Offensive Rating, which measures a team’s points scored per 100 possessions.

  • DEFRTG: Defensive Rating, which is the number of points allowed per 100 possessions by a team.

  • PACE: Pace, which is the number of possessions per 48 minutes for a team.

  • REGION: East/West.

  • ABV: The abbreviation of a team.

What information can we get from this plot?

What information can we get from this plot?

But “bad” plots may….

  • be hard to read if labels and legends are not clear

  • confuse people if it is not well-designed

  • deliver misleading information (sometimes in purpose)

What’s the issue in this plot?

Add labels for the legend and the y axis

What’s the issue in this plot?

## `geom_smooth()` using formula 'y ~ x'

Compare in the same scale

## `geom_smooth()` using formula 'y ~ x'

How to visualize data in R?

  • graphics: The R basic graphics package

  • ggplot2: The grammar of graphics

  • plotly: Interactive plot in RShiny

  • leaflet: Interactive maps

graphics package in R

plot(OFFRTG ~ DEFRTG, data = nba.data);  plot(WIN. ~ SEASON, data = nba.data)

Two types of functions

  • Fuctions to create complete plots:

    • plot(), boxplot(), hist()…
  • Functions to add elements to an existing plot:

    • points(), lines(), legend()…

Introduction to ggplot2

ggplot2

  • Grammar of graphic

  • Both quick and complex plot in an easy way

  • Nice aesthetic settings

  • Great docummentation and tons of online instructions

An example of ggplot2

The histogram of winning rate in different regular NBA seasons and regions:

ggplot(data = nba.data, aes(x = WIN.)) + 
  geom_histogram(binwidth = 0.1, color = "black") + facet_grid(REGION ~ SEASON)

Comparing with graphics package

Code for the same plot

par(mfrow = c(2, 4), mar = c(2, 2, 3, 1))
for(i in levels(nba.data$REGION)){
  for(j in levels(nba.data$SEASON)){
    subdata <- subset(nba.data, REGION == i & SEASON == j)
    hist(subdata$WIN., breaks = seq(0, 1, 0.1),
         main = paste(i, j, sep = " ,"))
  }
}

Grammar of Graphics

  • Idea: graph is a combination of independent building blocks.

  • Data that you want to visualise and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes.

  • Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, such as points, lines, polygons, etc. Statistical transformations, stats for short, summarise data in many useful ways.

  • The scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape.

  • A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic.

  • A facet describes how to break up the data into subsets and how to display those subsets as small multiples.

  • A theme which controls the finer points of display, like the font size and background colour.

The strat of plotting a graph

  • ggplot() is always the first line of the ggplot.

  • We can specify the data set and the aesthetics mapping variables in the ggplot().

p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.))
p

Aesthetics

  • Map the variables in the data to the components in the plot

  • x: x axis

  • y: y axis

  • color: color of the boundary of a symbol

  • fill: color of the inside of a symbol

  • shape: shape of points, solid point, circle, triangle…

  • size: size of points

  • linetype: type of lines, solid line, dashed line…

  • …

Geometries

  • Geometries are the actual graphical elements displayed in a plot. They can visualize the mapping variables (specified in aes()) from the data.

  • We use + to connect multiple geometrics

p + geom_point()

Geometries

  • We can also specify data and aes in geom function. They don’t have to be the same as those in ggplot().
ggplot() + geom_point(data = nba.data, aes(x = DEFRTG, y = WIN.))

geom function

  • One variable
p <- ggplot(data = nba.data, aes(x = WIN.))
p + geom_histogram(binwidth = 0.1)
p + geom_density()

geom function

  • Continuous X, continuous Y
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.))
p + geom_point(); p + geom_line(); p + geom_density_2d(); p + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

geom function

  • Discrete X, continuous Y
p <- ggplot(data = nba.data, aes(x = SEASON, y = WIN.))
p + geom_boxplot()
p + geom_violin()

geom function

  • Plot text on the graph: aes(x, y, label)
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + 
  facet_wrap(~SEASON)
p + geom_text(aes(label = ABV), size = 2)

Multiple geom layers

ggplot(data = nba.data, aes(x = WIN.)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.1, color = "black") +
  geom_density()

Multiple geom layers

ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) +
  geom_point() + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Multiple geom layers

ggplot(data = nba.data, aes(x = SEASON, y = WIN.)) +
  geom_violin() +
  geom_boxplot(width = 0.2)

The order of geom functions is important

ggplot(data = nba.data, aes(x = SEASON, y = WIN.)) +
  geom_boxplot(width = 0.2) +
  geom_violin()

Data transformation

  • Sometimes we need to transform the data set to keep variables consistent with the structure of the aesthetics.

  • For instance, if we want to compare the mean of winning rate between seasons and regions…

mean.win <- aggregate(WIN. ~ SEASON + REGION, FUN = mean, data = nba.data)
head(mean.win)
##      SEASON REGION      WIN.
## 1 2015-2016   East 0.4942000
## 2 2016-2017   East 0.4828667
## 3 2017-2018   East 0.4903333
## 4 2018-2019   East 0.4780667
## 5 2015-2016   West 0.5056000
## 6 2016-2017   West 0.5170667

Then we can generate a plot to compare the mean of winning rate based on the new data set.

Facet

  • Facet function can help you make panel plot very easily

  • facet_wrap wraps a 1d sequence of panels into 2d.

p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)
p + facet_wrap(~SEASON)
## `geom_smooth()` using formula 'y ~ x'

Facet

  • facet_grid forms a matrix of panels defined by row and column faceting variables.
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)
p + facet_grid(REGION ~ SEASON)
## `geom_smooth()` using formula 'y ~ x'

Facet

  • You can free the scales of x axis and y axis.
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)
p + facet_grid(REGION ~ SEASON, scales = "free")
## `geom_smooth()` using formula 'y ~ x'

Facet

  • Be careful when you free your scales…
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Scale

The scale functions control how the plot maps data values to the visual values of an aesthetic, for instance,

scale_x_continuous

scale_y_discrete

scale_color_gradient

scale_fill_manual

You can also specify the label of axis or legends in the scale funtion。

Scale: General Purpose Scales

  • scale_*_continuous: change the scale for continuous variable

  • scale_*_discrete: change the scale for discrete variable

  • scale_*_identity: use values without scaling

  • scale_*_manual: create your own discrete scale

Scale: General Purpose Scales

p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN., color = REGION)) + geom_point()
p

Scale: General Purpose Scales

p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN., color = REGION)) + geom_point()
p + scale_x_continuous(name = "offensive rate", limits = c(97, 116)) +
  scale_y_continuous(name = "winning rate", breaks= seq(0, 1, 0.1)) +
  scale_color_manual(name = "region", labels = c("EAST", "WEST"), values = c("blue", "red"))

Scales: X and Y Axis

  • General purpose scales work for x and y axis.
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN., color = REGION)) + geom_point()
p + scale_x_reverse()

Scale: Color & Fill

Scale: Size, Shape and Linetype

p <- ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, shape = REGION)) + geom_point()
p + scale_shape_discrete("Region", solid = FALSE)

Coordinate system

  • coord_* function control the transformation of the coordinate systems
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) +
  geom_point() + geom_smooth(method = "lm")
p + coord_fixed(ratio = 20); p + coord_flip(); p + coord_trans(y = "sqrt"); p + coord_polar()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Theme

  • We can change the theme of plot using theme_* function
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + 
  geom_point() + geom_smooth(method = "lm")
p + theme_bw(); p + theme_classic(); p + theme_grey(); p + theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Design your own plot

  • labs function can set the title, subtitle and caption of your plot.

  • theme function is a powerful way to customize the non-data components of your plots: i.e. titles, labels, fonts, background, gridlines, and legends. See R help for details.

Arrange and save your plots

  • grid.arrange from gridExtra package can place multiple ggplot on a page
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

  • ggsave can save the plot to your local drive.
ggsave(p, filename = "", height = , width = , units = )

ggplot2 cheat sheet

Practice 1

Practice 2

Other graphical packages in R

Extentions of ggplot2

  • GGally: An extention to reduce the complexity of combining geometric objects with transformed data

  • ggExtra: A package which can add marginal density plots or histograms to ggplot2 scatterplots.

  • ggrepel: A convenient package for geom_text()

  • gganimate: A grammar of animated graphics

  • more information: https://exts.ggplot2.tidyverse.org/gallery/

GGally

  • ggpairs: Make a matrix of plots with a given data set.
ggpairs(data = nba.data, 3:7)

GGally

  • ggcorr: plot a correlation matrix (heatmap) with ggplot2
ggcorr(data = nba.data[, c(3:7)])

ggExtra

  • ggMarginal: Create a ggplot2 scatterplot with marginal density plots (default) or histograms, or add the marginal plots to an existing scatterplot.
p <- ggplot(nba.data, aes(x = OFFRTG, y = DEFRTG, color = REGION)) +
  geom_point() + theme_bw() + theme(legend.position = "bottom")
ggMarginal(p, groupColour = TRUE, groupFill = TRUE)

ggrepel

  • geom_text_repel can solve the problem of overlapping labels when we plot text on the graph.
ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) +
  geom_point(aes(color = REGION), shape = 1) + 
  geom_text_repel(data = subset(nba.data, WIN. >= 0.6 | WIN. <= 0.3), 
                  aes(label = ABV), size = 1.5, box.padding = 0.3) +
  facet_wrap(~SEASON)

ggrepel

  • geom_label_repel draws a rectangle underneath the text, making it easier to read.
ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) + 
  geom_point(aes(color = REGION), shape = 1) + 
  geom_label_repel(data = subset(nba.data, WIN. >= 0.6 | WIN. <= 0.3), 
                  aes(label = ABV), size = 1.5, box.padding = 0.3) +
  facet_wrap(~SEASON)

gganimate

gganimate

ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) +
  geom_point(aes(color = REGION), shape = 1) + 
  geom_text_repel(aes(label = ABV), size = 1.5, box.padding = 0.3) +
  theme_bw() +
  scale_y_reverse(limits = c(120, 97)) +
  scale_color_manual(values = c("blue3", "red3")) +
  # Here comes the gganimate specific bits
  labs(title = 'SEASON: {closest_state}', x = 'OFFRTG', y = 'DEFRTG') +
  theme(title = element_text(size = 5), 
        text = element_text(size = 2)) +
  transition_states(SEASON,
                    transition_length = 2,
                    state_length = 1)

Summary

  • Introduction to data visualization

  • Grammar of graphics: ggplot2

  • Practices with basketball game data

  • Extensions of ggplot2

Thanks for listening!